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203stract 



A Monte Carlo study was conducted to compare the performance 
of three statistical indices of test item bias in small 
samples of -examinees. The statistical indices compared were 
the Delta method, the Mantel-Haenszel method, and the 
Standardization method, sample sizes of 50, 100, and 200 
were examined. One thousand samples of each size were drawn 
with replacement from each of three archival data files from 
teacher subject area tests. Each sample was drawn so that 
80% of the examinees were sampled from a reference group and 
20% from a focal group. Item bias was experimentally 
controlled in the study, and the effectiveness of the 
indices was evaluated as the proportion of such biased items 
appropriately identified. 
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Sor««ning Itmns for Bias: Kl Empirioal Comparison of the 
Performance of Three Indices in Small Samples of Examinees 

The evaluation of test items for bias is an important 
component in the analysis of examination results. Several 
statistical indices have been proposed as useful screening 
tools for test item bias (Angoff & Ford, 1973; Dorans & 
Kulik, 1986; Holland & Thayer, 1988). Practical research 
related to the utility of such indices and comparisons among 
the indices have been presented by Hills (1989), Perlman, 
Bezrucsko, Junker, Reynolds, Rice, & Schulz (1988), and 
Shepard, Camilli, and Williams (1985) . 

A practical issue which has not been adequately researched 
is the utility of statistical screening indices when the 
number of examinees is small. The purposes of this research 
were (a) to appraise the sensitivity of three indices of 
item bias in small sample situations, (b) to estimate the 
stability of the indices, and (c) to provide recommendations 
on the appropriate use of statistical indices for item bias 
screening in small sample testing programs. 

Statistical Indices of Item Bias 

Three indices of item bias were compared in this study: 
Angoff 's Delta method (Angoff & Ford, 1973), the Mantel- 
Haenszel method (Holland & Thayer, 1988) , and the 
Standardization method (Dorans & Kulik, 1986) . Each index is 
briefly described below. More detailed treatments of these 
statistical indices are provided in Scheuneroan and Bleistein 
(1989) . 



Anaoff s Delta Method 
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The Delta method is based on differences in transformed item 
difficulty indices between the reference group (e.g., white 
examinees) ana the focal group of examinees (e.g., b^ack 
examinees). Item difficulties are computed separately fcr 
the two groups. To convert the difficulty indices to an 
equal interval scale, the difficulty indices are transformed 
to inverse standard normal deviates (z-scores) . By 
convention, the Delta values are obtained by linearly 
transforming the normal deviates to a scale yielding a mean 
of 13 and a standard deviation of 4 (Delta = 4z + 13). 

A scatterplot of the Delta values for the two groups 
produces an ellipse and an item is flagged as potentially 
biased when the coordinates of the item in such a 
scatterplot are distant from the majority of item points in 
the bivariate space. Specifically, the perpendicular 
distance of each item from the major axis of the ellipse is 
calculated, and this distance measure is used as an index of 
potential bias. 

Mantel-Haenszel Method 

The Mantel-Haenszel method provides a comparison of the 
performance of examinees in the reference and focal groups, 
while matching examinees on total test score. The method is 
best conceptualized by considering a series of 2 X 2 
contingency tables, each table representing examinees who 
have received the same total score on the test: 
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Table for Examinees with Total Score = s 



Item Performance 



Examinee Group 


Right 


Wrong 


Total 


Focal (f) 


Rfs 


Wfs 


Nfs 


Reference (r) 


Rrs 


Wrs 


Nrs 


Total 


Rts 


WtS 


Nts 



The index of potential item bias is given by the weighted 
sum of odds ratios across the set of contingency tables: 

S {(Rrs Wfs) / Ntjj} 
s 

am " 

S {(Rfs Wrs) / Nts) 
s 

With small samples of examinees, matching by total score is 
not feasible (such matching yields contingency tables in 
which most cells are empty) . The procedure was modified for 
the small sample situation in this research by dividing the 
examinees into only two strata (high and low total scores) . 
The point of demarcation between strata was computed as the 
median total score of the sample of focal group members. 

The Standardization Method 

The Standardization method provides an- index based on the 
weighted difference in difficulty indices for the reference 
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and focal groups, while matching examinees on the basis of 
total test score: 

Dstd = S Nfs Pfs - S Nfs Prs 
s s 

where pfg and pj-g are the item difficulty indices for the 
focal and reference group examinees at total score level s. 

As with the modification described above for the Mantel- 
Haenszel procedure, the examinees in each sample were 
divided into only two matched groups (high and low total 
scores) , based on the median total score of the focal group 
members in each sample. 

Applications of Item Bias Indices to Small samples 

Dorans (1989), and Shepard et al. (1985) recommended 
additional research on applications of item bias indices in 
small samples of examinees. In an attempt to address the 
small sample issue, Shepard et al. (19«5) examined 
statistical >ndices with focal group sizes of 300 examinees. 
However, even this number is considerably larger than the 
size of the focal group examinee samples in many small 
testing programs. 

Hills (1989) stated that Angoff's Delta method is commonly 
used with small samples. The Delta method has been 
recommended as the best, and sometimes as the only, choice 
when screening examination data from small samples 
(Scheuneman & Bleistein, 1989) . In an early study of the 
Delta method, Angoff and Ford (1973) evaluated 10 samples 
presenting numbers of cases ranging from 125 to 340. 
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However, problems with the Delta method and evidence of 
spurious bias identification led Shepard et al. (1985) to 
recommend that the unmodified Delta method not be used at 
all, even under small sample constraints. Harris and Kolen 
(.989) used bootstrap methodology to examine the stability 
of the Delta method across 250 resamplings. The authors 
found only moderate stability for the Delta method even with 
sample sizes of 400 in both the reference and focal groups. 
Similarly, in a comparison of four item bias screening 
statistics, Perlman et al. (1988) found the Delta method to 
have the lowest correlation with other statistics proposed 
for bias detection. 

In contrast to the Delta method, the Standardization method 
is generally recommended for use only with relatively large 
samples of examinees (Dorans, 1989; Scheunsman & Bleistein, 
1989). Dorans and Kulik (1986) suggested that confusing 
results in an item bias study may have been attributable to 
an "inadequate" number of focal group members, when the size 
of the focal group was 2,616. Hills (1989) stated that the 
Standardization method is normally used in testing programs 
with 10,000 or more examinees, and he recommended a minimum 
of 5,000 examinees in order to appropriately apply this 
technique. However, the Educational Testing Service (ETS) 
routinely uses the Standardization method to examine the SAT 
for item bias. For pre-trial items, ETS requires only lOO 
examinees in the focal group and 500 total examinees; for 
final form items, 200 examinees are required ir the focal 
group and 600 examinees in all (ilchmitt, personal 
communication, July 25, 1991). 

The Mantel-Haenszel technique has been suggested as a more 
appropriate method for use with relatively small samples 
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(Buhr & Legg, 1990; Dorans, 1989; Holland & Thayer, 1986). 
Hills (1989) suggested that the Mantel-Haenszel method could 
be used with as few as 100 examinees in each group. DeMauro 
(1990) reiterated ETS' requirement of 200 focal group 
examinees and 600 total examinees. However, Perlman et al. 
(1988) found relatively low reliability of item bias 
identification across 30 replications for both the Mantel- 
Haenszel and the Delta methods, when the samples consisted 
of 200 to 300 examinees in each group. The reported 
reliability indices ranged from .38 to .58 for the Mantel- 
Haenszel, and from .39 to .53 for the Delta method. 

Additional research is needed to assess the relative 
performance of item bias detection techniques applied to 
small samples of examinees. The information that is 
currently available to administrators of testing programs 
that service limited numbers of examinees is inadequate for 
making informed decisions about item bias screening 
techniques. To this end, the Monte Carlo study reported in 
this paper was undertaken. The relative effectiveness in 
small samples of the Delta method and the modified versions 
of the Mantel-Haenszel and Standardization methods described 
above were evaluated. The remainder of this paper describes 
the method, results and implications of this study. 

Method 

Pseudo-DODulations Examined 

The data on which this research was conducted were random 
samples drawn with replacement from existing archival test 
data files. These files represented pseudo-populations from 
which samples were drawn. Data files from three teacher 
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subject area tests were used: Elementary Education (1-6), 
Early childhood Education (K-3) , and Specific Learning 
Disabilities (K-12). These pseudo-populations will be 
referred to as test forms 1, 2 and 3, respectively, in the 
remainder of this report. The number of multiple-choice 
items on the test forms were 140, 141, and 118 for forms 1, 
2, and 3, respectively. 

Induction of Item Bias 

Within each pseudo-population, item bias was experimentally 
controlled using the following procedure. Each pseudo- 
population was randomly divided irto two files, one with 80% 
of the examinee records (representing the reference group) 
and the other with 20% of the examinee records (representing 
the focal group) . Using this random division of records 
provides data files in which differences in item difficulty 
occur only by chance. To verify the equivalence of item 
difficulty in the two groups, the p-values were compared 
prior to the induction of bias. 

In each pseudo-population, nine items were selected for bias 
induction. Items were selected to represent all combinations 
of high, moderate am \ow values of both difficulty and 
discrimination (i.e., one item was selected at low values of 
both difficulty and discrimination, a second item was 
selected at a low value of difficulty and a raoderate value 
of discrimination, etc.). Low values of item difficulty were 
below 0.30, moderate values were between 0.30 and 0.70, and 
high values were greater than 0.70. Low values of item 
discrimination were below 0.20, moderate values were between 
0.20 and 0.35 and high values were greater than 0.35. 
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Bias was induced in the items by, first, stratifying the 
reference and focal groups on the basis of total test score 
and, second, modifying the item responses of randomly 
selected records within each stratum of each of the 
populations. The induced bias was designed to yield a 
difference in item difficulty favoring the reference group. 
For the randomly selected records, correct responses in the 
focal group were changed to incorrect, and the reverse was 
followed for the reference group. The stratification of the 
data files by total test scorp. was used to maintain the item 
discrimination indices at the level obtained before the bias 
induction. Three levels of bias (magnitude of difference in 
item p-value between the reference and focal pseudo- 
populations) were examined in this research: 0.1, 0.2, and 
0.3. The effects of each level of bias were examined in 
separate replications of the study. That is, in the first 
execution, all nine items were induced to a 0.1 level of 
bias; in the second, all nine items were induced to a 0.2 
level; and in the third execution, all nine items were 
induced to a 0.3 level. 

After the item bias was induced in tha pseudo-population 
files, random samples were drawn with replacement, and the 
three bias detection indices were computed for each item in 
each sample. One thousand samples were drawn of size 50, 
100, and 200. The samples were drawn such that 80% of the 
records in each sample were obtained from the reference 
group and 20% of the records in each sample were obtained 
from the focal group. Thus, for example, a sample size of 
fifty represents 40 observations from the reference group 
and 10 observations from the focal group. 
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Results 

The overall descriptive statistics for the three bias 
indicators are presented in Table 1. For each sample size 
and level of imputed bias, the mean and standard deviation 
of each bias indicator, computed over test items and the 
1000 replications, is presented. These descriptive 
statistics were computed separately for the bias-induced 
items and the non-induced items within each pseudo- 
population. 



Insert Table 1 about here 



To examine the sensitivity of each statistic to biased 
items, effect sizes were calculated. The effect size of each 
statistic is the ratio of the difference between the mean 
values of the statistic for bias-induced items and unbiased 
items to the empirical estimate of the standard error of the 
statistic. Thus, the effect size is given by 

Mb - Mu 

Effect Size = 

A 

where 

A 

Mb roean value of the statistic for bias-induced items, 
Atu = mean value of the statistic for unbiased items, and 
ffu = standard error of the statistic for unbiased items. 
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The effect size for each statistic, under each of the 
conditions examined in the study are presented in the right 
column in Table 1. For all three statistics, the effect size 
increases as (a) sample size increases, and (b) the 
magnitude of the bias increases. 

In comparing the effect sizes across the three statistical 
indices, the effect size of the Mantel-Haenszel chi-square 
statistic is consistently the largest, and the effect size 
of the Delta statistic is consistently the smallest. In 
samples of size 50, the best performance by the Delta method 
was with test form 2 and a bias effect of 0.3, which yielded 
an effect size of 0.70 (i.e., the mean value of the 
statistic for biased items was seven-tenths of a standard 
error higher than the mean of the statistic for the unbiased 
items) . In contrast, for samples of size 50 from test form 2 
and a bias effect of 0.3, the Standardized difficulty index 
yielded an effect size of 1.71, and the Mantel-Haenszel 
yielded an effect size of 3.68. 

Across all of the conditions examined in the experiment, the 
best performance by the Delta method was obtained with 
samples of size 200 from exam form 1 with a bias effect of 
0.3. In this condition the effect size of Delta was a 
substantial 3.29. However, under this condition the effect 
sizes for the Standardized difficulty index and the Mantel- 
Haenszel were 5.00 and 8.27, respectively. 

Proportion of correct Identification of Biased Items 

For each of the 1000 samples evaluated in each condition 
examined (each combination of sample size, examination form, 
and level of bias) , the nine items with the most extreme 
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value of each statistic were flagged as potentially biased. 
An intuitive index of the overall success of each bias 
detection statistic is the proportion of these flagged items 
(bias-declared items) that were the bias-induced items in 
the pseudo-populations. Optimal performance by a screening 
statistic would result in this index reaching a value of one 
(when the nine bias-declared items in every sample are the 
nine bias- induced items) . The overall success rate for each 
screening statistic in biased item detection is presented in 
Table 2. 



Insert Table 2 about here 



The results of this analysis parallel those obtained from 
the examination of the effect sizes of the indices. The best 
performance in biased item detection was obtained from the 
Mantel-Haenszel and the worst performance from the Delta 
method. Note that in the examination of the proportion of 
items correctly identified as biased, the difference between 
the performance of the Mantel-Haenszel and the performance 
of the Standardized difficulty method appears to be trivial. 
In 20 of the 27 examination conditions examined in the 
experiment, the difference between the success rates was 5% 
or less. However, under only one of the conditions examined 
(test form 3, samples of size 200, and a bias effect of 0.3) 
the Standardization technique outperformed the Mantel- 
Haenszel (73% vs. 72% success). In the analysis of the 
effect sizes of the statistics, the difference between these 
two statistics was notably more pronounced. 

For the Delta method, fewer than one-half of the bias- 
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declared items were bias-induced items in 25 of the 27 
conditions examined. Only with samples of size 200 from exam 
forms 1 and 2 and a bias effect of 0.3 did this technique 
exceed 50% success. The Standardized difficulty index 
exceeded 50% success in only seven of the 27 conditions 
examined, and the Mantel-Haenszel exceeded 50% correct in 
nine of the 27 conditions. 

To further investigate the success of each statistic at 
identifying biased items, the success rate of each index for 
each of the bias-induced test items was examined. Summary 
data on the overall success rates (collapsing across sample 
sizes) are presented in Table 3 . 



Insert Table 3 about here 



In comparing the performance of the indicators presented in 
Table 3, the similarity between the Mantel-Haenszel and the 
Standardization methods is evident, as is the superiority of 
either of these methods to the Delta technique. The Delta 
technique shows its greatest effectiveness at idenVifying 
bias in items that have high p-values. This effect is 
attributable to the inverse-normal transformation of the 
group p-values used in this technique. This transformation 
accentuates differences in item difficulties at the extreme 
values (p-values near zero or one) . For the three items with 
high p-values on exam form 2, the Delta method outperformed 
the Standardization technique consistently, as it did on one 
of the high p-value items on test form 1. However, the Delta 
method did not outperform the Mantel-Haenszel on any of the 
items examined. 
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The similarity in the performance of the Mantel-Haenszel and 
the Standardization method is evident in the data presented 
in Table 3. In 48 of the 81 conditions summarized in this 
table (59% of the conditions) , the Mantel-Haenszel statistic 
was more effective in bias identification than the Standardization 
method. However, the difference in effectiveness is typically 
quite small. Of the 33 experimental conditions in which the 
Standardization technique outperformed the Mantel-Haenszel, 
25 of the conditions w&re in bias detection of items of 
moderate difficulty. 

Discussion and Recommendations 

With samples of size fifty, none of the bias screening 
statistics examined reached an overall fifty percent success 
rate at bias detection. The best performance with these 
small samples was achieved by the Mantel-Haenszel method. 
With a bias level of 0.3, the bias-induced items flagged by 
this statistic were identified as biased items 47% of time 
for exam form 1, 45% of the time for exam form 2, and 37% of 
the time for exam form 3. The corresponding rates for the 
Standardization method were 44%, 33%, and 36%. The Delta 
method showed much lower rates of successful bias detection 
(18%, 19%, and 11% for the three examinations in samples of 
size 50 and a bias level of 0.3). 

When the sample size increased to 100, both the Mantel- 
Haenszel and the Standardization methods exceeded 50% 
success at bias level 0.3. At this sample size, the Mantel- 
Haenszel statistic reached or exceeded 40% success for the 
bias level of 0.2, while the Standardization method remained 
between 30% and 40% success at this bias level. The 
performance of the Delta method was notably lower than 
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these levels. 



Small Sample Item Bias Screening 

16 



Finally, at samples of size 200, the Mantel-Haenszel's 
success rate exceeded 70% with bias levels of 0.3, and 
exceeded 50% with bias levels of 0.2, but remained below 30% 
for a bias level of 0.1. The Standardization method's 
success rate also exceeded 70% with bias levels of 0.3, but 
dropped below 50% success with a bias level of 0.2, and 
dropped below 20% with a bias level of 0.1. At this sample 
size and a bias level of 0.3, the Delta method's success 
rate exceeded 50% on two of the three examinations. 

Although the success rates with small samples seem low, the 
probability of an item being flagged as biased in this 
experiment (i.e., being one of the nine items with the most 
extreme value of the statistic) , if items are responding 
randomly to the statistic, is just over 6% for a test with 
140 items, and just under 8% for a test with 118 items. 
These chance rates were clearly exceeded even when only ten 
focal group examinees were included in the sample (samples 
of size 50) . 

In comparing the three statistical indices, the Mantel- 
Haenszel is the best performer. Although its performance 
advantage over the Standardization method is slight, the 
advantage is consistent over the conditions examined in this 
study. The only exception to this advantage is the detection 
of bias in items of moderate difficulty level, when the 
Standardization method outperformed the Mantel-Haenszel. The 
most striking outcome of this research is the clear 
superiority across all conditions examined of both the 
Mantel-Haenszel and the Standardization method to the Delta 
method. Previous research has shown that the Delta method 



17 



Small Sample Item Bias Screening 

17 

correlates poorly with other bias detection indices (Buhr & 
Legg, 1990; Perlman et al, , 1988). However, because these 
studies did not use experimentally-controlled, induced item 
bias, no criterion was available for evaluating the accuracy 
of the various bias detection methods under investigation. 
This study clearly shows that not only does the Delta method 
correlate poorly with the Mantel-Haenszel and the 
Standardization methods, it is also much less accurate in 
the identification of biased items. 

Shepard et al. (1985) point out two classes of weaknesses in 
much of the research which has been done on item bias 
indices. In the first, simulated data is used so that 
definitive knowledge about which items are biased is 
available. However, the simulated data in these studies may 
not accurately resemble the performance of real examinees. 
In the second type of study, real data are used; in these 
studies, however, the researcher has no prior knowledge of 
which items (if any) should be classified as biased. The 
present study has attempted to overcome both of these 
methodological weaknesses by retaining the advantages of 
using real data, while inducing item bias in order to gain 
the benefit of accurate information regarding the 
correctness of the bias indices. 

In summary, previous research suggesting that item bias 
indices such as the Mantel-Haenszel and Standardization 
methods should only be applied to large samples may have 
been overly conservative. The results of this study support 
the use of statistical screening for item bias, even with 
samples as small as 50 examinees, :ind with only ten focal 
group members in each sample. The proportion of items 
correctly identified as biased by all indices examined 
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increased with both sample size and the magnitude of the 
bias. The statistical index evidencing the best performance 
at bias detection was the Mantel-Haenszel statistic, 
followed closely by the Standardization method. Either of 
the indices clearly outperformed the Delta method. 
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Table 1 

Means, Standard Deviations, and Effect Sizes of Three Bias Detection Indices 



Mantel -Haense I 



Standardized Difficulty 



Delta 



Exam 
f orm 


Simple 
Size 


Bias 
Level 


Unbiased lte«M 


Biased Items 


Unbiased 


Items 


Biased Items 


Unbiased Items 


Biased 


Items 




Effect Sizes 


Mi 


SO 


m 


SO 


HN 


SO 


NN 


SO 


MN 


SO 


MN 


SO 


M-H 


Stand. P 


Delta 




50 
50 
50 


0.1 
0.2 
0.3 


1.36 
1.34 
1.29 


1.70 
1.66 
1.62 


2.39 
4.05 
6.58 


3.02 
4.67 
7.44 


0.12 
0.12 
0.12 


0.09 
0.09 
0.09 


0.15 
0.21 
0.29 


0.11 
0.14 
0.15 


1.34 
1.35 
1.35 


1.00 
1.01 
1.00 


1.20 
1.51 
2.00 


0.88 
1.04 
1.26 


0.61 
1.63 
3.26 


0.39 
1.04 
1.93 


-0.14 
0.16 
0.64 


j 


100 
100 

4 AA 

100 


0.1 
0.2 

A V 

0.3 


1.16 
1.12 
1.09 


0.89 
0.85 
0.84 


1.94 
3.29 
5.88 


1.34 
2.49 
4.83 


0.08 
0.08 
0.08 


0.06 
0.06 
0.06 


0.12 
0.20 
0.29 


0.09 
0.11 
0.11 


0.90 
0.90 
0.92 


0.70 
0.70 
0.70 


0.93 
1.41 
2.08 


0.64 
0.80 
0.93 


0.87 
2.56 
5.73 


0.62 
1.78 
3.24 


0.04 
0.72 
1.65 




200 
200 

200 


0.1 
0.2 

A V 

0.3 


1.06 
1.03 
1.00 


0.53 
0.51 
0.50 


1.76 
2.90 
5.13 


0.75 
1.37 
2.83 


0.06 
0.06 
0.06 


0.04 
0.04 
0.05 


0.11 
0.19 
0.29 


0.07 
0.08 
0.08 


0.59 
0.60 
0.6^ 


0.47 
0.47 
0.47 


0.81 
1.44 
2.18 


0.51 
0.62 
0.65 


1.34 
3.70 
8.27 


1.06 
2.95 
5.00 


0.47 
1.78 
3.29 


2 
2 
2 


SO 

50 
50 


0.1 
0.2 

A V 

0.3 


1.31 
1.26 
1.22 


1.70 
1.61 
1.58 


2.34 
4.06 
7.03 


2.98 
5.58 
8.72 


0.12 
0.12 
0.12 


0.09 
0.09 
0.09 


0.14 
0.19 
0.27 


0.11 
0.13 
0.15 


1.36 
1.37 
1.37 


1.07 
1.08 
1.07 


1.34 
1.62 
2.13 


0.93 
1.05 
1.25 


0.61 
1.74 
3.68 


0.30 
0.83 
1.71 


-0.01 
0.24 
0.70 


CM CM CM 


100 

100 

100 


0.1 
0.2 
0.3 


1.12 
1.08 
1.04 


0.91 
0.88 
0.83 


1.90 
3.37 
6.12 


1.42 
2.71 
4.71 


0.08 
0.08 
0.09 


0.06 
0.06 
0.06 


0.11 
0.18 
0.26 


0.08 
0.10 
0.11 


0.9S 
0.96 
0.96 


0.73 
0.73 
0.73 


1.02 
1.45 
2.06 


0.65 
0.75 
0.89 


0.87 
2.59 
6.09 


0.44 
1.46 
2.76 


0.09 
0.67 
1.51 


2 
2 
2 


200 
200 

cUU 


0.1 
0.2 

ft T 

0.3 


1.02 
0.99 
0.96 


0.54 
0.52 
0.51 


1.70 
2.95 
5.34 


0.77 
1.45 
3.02 


0.06 
0.06 
0.06 


0.05 
0.05 
0.05 


0.09 
0.17 
0.26 


0.06 
0.08 
0.08 


0.67 
0.68 
0.69 


0.50 
0.51 
0.51 


0.84 
1.42 
2.13 


0.49 
0.58 
0.64 


1.28 
3.75 
8.52 


0.70 
2.35 
4.28 


0.34 
1.47 
2.83 




50 
50 
50 


0.1 
0.2 
0.3 


1.30 
1.27 
1.22 


1.86 
1.80 
1.75 


2.28 
3.52 
5.39 


2.69 
4.08 
6.47 


0.11 
0.11 
0.11 


0.08 
0.08 
9.08 


0.14 
0.18 
0.24 


0.10 
0.12 
0.13 


1.56 
1.56 
1.56 


1.20 
1.20 
1.20 


1.41 
1.57 
1.90 


0.99 
1.05 
1.16 


0.53 
1.25 
2.39 


0.43 
0.92 
1.65 


-0.12 
0.01 
0.29 




100 
100 
100 


0.1 
0.2 
0.3 


1.13 
1.08 
1.06 


1.03 
0.95 
0.59 


2.10 
3.33 
5.28 


1.57 
2.43 
0.51 


0.08 
0.03 
0.08 


0.06 
0.06 
0.06 


0.11 
0.17 
0.24 


0.08 
0.09 
0.10 


1.10 
1.11 
1.13 


0.83 
0.83 
0.8A 


0.98 
1.28 
1.74 


0.71 
0.83 
0.97 


0.94 
2.36 
7.14 


0.62 
1.51 
2.68 


-0.14 
0.20 
0.73 




200 
200 
200 


0.1 
0.2 
0.3 


1.04 
0.99 
0.97 


0.63 
0.60 
0.59 


1.92 
3.30 
5.53 


1.04 
1.96 
3.07 


0.06 
0.06 
0.06 


0.04 
0.04 
0.04 


0.10 
0.16 
0.24 


0.06 
0.07 
0.07 


0.82 
0.82 
0.85 


0.58 
0.58 
0.58 


0.78 
1.17 
1.73 


0.49 
0.61 
0.73 


1.41 
3.88 
7.78 


0.88 
2.37 
4.02 


-0.07 
0.60 
1.52 
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Table 2 

Proportion of Biased Items Correctly Identified b/ Hiree Bias Indices 



Proportion of Correctly Classified Biased Items 



Exam Sairple Bias 

Fonn Size Level M-H Stand. P Delta 



1 


50 


0.1 


0.16 


0.13 


0.04 


1 


50 


0.2 


0.30 


0.26 


0.09 


X 








U. 44 


0. 18 


1 


100 


u.l 


0.20 


0.18 


0.06 


1 


100 


0.2 


0.45 


0.40 


0.18 


1 

X 




0 T 

W.J 




U* OD 




1 


200 


0.1 


0.29 


0.26 


0.15 


1 


200 


0.2 


0.63 


0.60 


0.41 


1 

X 


700 


0 3 

V.J 


O ft7 




U. oD 


2 


50 


0.1 


0.15 


0.12 


0.05 


2 


50 


0.2 


0.29 


0.21 


0.10 


2 


50 


0.3 


0.45 


0.38 


0.19 


2 


100 


0.1 


0.19 


0.14 


0.07 


2 


100 


0.2 


0.42 


0.33 


0.18 


2 


100 


0.3 


0.64 


0.57 


0.32 


2 


200 


0.1 


0.26 


0.19 


0.13 


2 


200 


0.2 


0.57 


0.49 


0.32 


2 


200 


0.3 


0.79 


0.76 


0.51 


3 


50 


0.1 


0.15 


0.14 


0.04 


3 


50 


0.2 


0.26 


0.23 


0.07 


3 


50 


0.3 


0.37 


0.36 


0.11 


3 


100 


0.1 


0.21 


0.18 


0.05 


3 


100 


0.2 


0.40 


0.34 


0.12 


3 


100 


0.3 


0.55 


0.54 


0.21 


3 


200 


0.1 


0.27 


0.22 


0.06 


3 


200 


0.2 


0.53 


0.48 


0.17 


3 


200 


0.3 


0.72 


0.73 


0.31 
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Table 3 

Proportion of Items Correctly Identified As Biased By Three Statistical Indices 
for Each Level of item Discrimination and Difficulty 



Exafflfonn> 1 


Item 
Discrimination 


Item 
Difficulty 


Bias>0.1 


Bias>0.2 


Bias>0.3 


Item Bias Index 


Item Bias Index 


Item Bias Index 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


Low 


Low 


0.225 


0.239 


0.043 


0.469 


0.503 


0.158 


0.691 


0.739 


0.327 


Low 


Middle 


0.204 


0.265 


0.055 


0.432 


0.516 


0.144 


0.617 


0.709 


0.276 


Low 


High 


0.280 


0.U6 


0.201 


0.587 


0.408 


0.474 


0.794 


0.666 


0.698 


Middle 


Low 


0.169 


0.166 


0.029 


0.398 


0.370 


0.098 


0.644 


0.609 


0.257 


Middle 


Middle 


0.172 


0.206 


0.038 


0.350 


0.398 


0.112 


0.544 


0.614 


0.207 


Middle 


High 


0.260 


0.175 


0.152 


0.531 


0.423 


0.386 


0.730 


0.642 


0.598 


Nigh 


Low 


0.165 


0.H5 


0.022 


0.404 


0.344 


0.100 


0.661 


0.592 


0.238 


Nigh 


Middle 


0.202 


0.204 


0.052 


0.419 


0.413 


0.165 


0.622 


0.618 


0.295 


High 


High 


0.282 


0.161 


0.157 


0.550 


0.406 


0.383 


0.754 


0.645 


0.623 



Examforms 2 


Item 
Discrimination 


Item 
Difficulty 


Bias-0.1 


Bias>0.7. 


Bias>0.3 


Item Bins Index 


Item Bias Index 


Item Bias index 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


Low 


Low 


0.118 


...121 


0.017 


0.308 


0.289 


0.039 


0.546 


0.551 


0.117 


Low 


Middle 


0.230 


0.266 


0.060 


0.450 


0.501 


0.152 


0.655 


0.712 


0.296 


Low 


High 


0.418 


0.189 


0.299 


0.724 


0.486 


0.596 


0.866 


0.710 


0.778 


Middle 


Low 


0.279 


0.261 


0.029 


0.520 


0.485 


0.093 


0.724 


0.7?,0 


0.281 


Middle 


Middle 


0.069 


0.088 


0.009 


0.204 


0.227 


0.034 


0.405 


0.435 


0.098 


Middle 


High 


0.177 


0.060 


0.090 


0.492 


0.248 


0.330 


0.738 


0.516 


0.594 


Nigh 


Low 


0.096 


0.099 


0.006 


0.276 


0.235 


0.019 


0.504 


0.438 


0.060 


Nigh 


Middle 


0.065 


0.083 


0.008 


0.209 


0.214 


0.031 


0.398 


0.406 


0.100 


Nigh 


High 


0.381 


0.150 


0.226 


0.687 


0.392 


0.506 


0.834 


0.638 


0.736 





Examform* 3 


Item 

Discrimination 


Item 
Difficulty 


Bias'O.I 


BiasBO.2 


Bias>0.3 


Item Bias Index 


Item Bias Index 


Item Bias Index 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


M-H 


Stan. P 


Delta 


Low 


Low 


0.239 


0.071 


0.041 


0.397 


0.196 


0.104 


0.405 


0.339 


0.182 


Low 


Middle 


0.145 


0.244 


0.006 


0.257 


0.^02 


0.010 


0.445 


0.592 


0.027 


Low 


High 


0.447 


0.279 


0.257 


0.692 


0.531 


0.491 


0.851 


0.719 


0.673 


Middle 


Low 


0.161 


0.148 


0.012 


0.355 


0.313 


0.029 


0.584 


0.572 


0.091 


»<iddle 


Middle 


0.169 


0.260 


0.005 


0.322 


0.443 


0.022 


0.500 


0.641 


0.041 


Middle 


High 


0.151 


0.078 


0.061 


0.389 


0.252 


0.148 


0.612 


0.464 


0.338 


High 


Low 


0.371 


0.300 


0.039 


0.561 


0.464 


0.092 


0.565 


0.641 


0.203 


High 


Middle 


0.078 


0.155 


0.004 


0.186 


0.275 


0.004 


0.345 


0.467 


0.009 


High 


High 


0.143 


0.078 


0.049 


0.409 


0.255 


0.148 


0.617 


0.452 


0.319 
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